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Abstract —In this paper we address the prohlem of performing 
statisticai inference for iarge scaie data sets i.e., Big Data. The 
voiume and dimensionaiity of the data may he so high that 
it cannot he processed or stored in a singie computing node. 
We propose a scaiabie, statisticaiiy robust and computationaiiy 
efficient bootstrap method, compatibie with distributed process¬ 
ing and storage systems. Bootstrap resampies are constructed 
with smaiier number of distinct data points on muitipie disjoint 
subsets of data, simitariy to the bag of iittie bootstrap method 
(BLB) Q. Then significant savings in computation is achieved by 
avoiding the recomputation of the estimator for each bootstrap 
sample. Instead, a computationally efficient fixed-point estima¬ 
tion equation is analytically solved via a smart approximation 
following the Fast and Robust Bootstrap method (FRB) 0 
Our proposed bootstrap method facilitates the use of highly 
robust statistical methods in analyzing large scale data sets. The 
favorable statistical properties of the method are established 
analytically. Numerical examples demonstrate scalability, low 
complexity and robust statistical performance of the method in 
analyzing large data sets. 

Index Terms —bootstrap, bag of little bootstraps, fast and 
robust bootstrap, big data, robust estimation, distributed com¬ 
putation. 

I. Introduction 

R ecent advances in digital technology have led to a 
proliferation of large scale data sets. Examples include 
climate data, social networking, smart phone and health data, 
etc. Inferential statistical analysis of such large scale data sets 
is crucial in order to quantify statistical correctness of param¬ 
eter estimates and testing hypothesis. However, the volume of 
the data has grown to an extent that cannot be effectively han¬ 
dled by traditional statistical analysis and inferential methods. 
Processing and storage of massive data sets becomes possi¬ 
ble through parallel and distributed architectures. Performing 
statistical inference on massive data sets using distributed and 
parallel platforms require fundamental changes in statistical 
methodology. Even estimation of a parameter of interest based 
on the entire massive data set can be prohibitively expensive. 
In addition, assigning estimates of uncertainty (error bars, 
confidence intervals, etc) to the point estimates is not compu¬ 
tationally feasible using the conventional statistical inference 
methodology such as bootstrap Q. 

The bootstrap method is known as a consistent method of 
assigning estimates of uncertainty (e.g., standard deviation, 
confidence intervals, etc.) to statistical estimates 0, 0 and 
it is commonly applied in the field of signal processing 0, 
©■ However, for at least two obvious reasons the method 
is computationally impractical for analysis of modem high 
volume and high-dimensional data sets; First, the size of each 
bootstrap sample is the same as the original big data set (with 


about 63% of data points appearing at least once in each sam¬ 
ple typically), thus leading to processing and storage problems 
even in advanced computing systems. Second, (re)computation 
of value of the estimator for each massive bootstrapped data 
set is not feasible even for estimators with moderate level of 
computational complexity. Variants such as subsampling 0 
and the m out of n bootstrap © were proposed to reduce 
the computational cost of bootstrap by computation of the 
point estimates on smaller subsamples of the original data set. 
Implementation of such methods is even more problematic as 
the output is sensitive to the size of the subsamples m. In 
addition extra analytical effort is needed in order to re-scale 
the output to the right size. 

The bag of little bootstraps (BLB) 0 modifies the conven¬ 
tional bootstrap to make it applicable for massive data sets. In 
BLB method the massive data is subdivided randomly into dis¬ 
joint subsets (i.e., so called subsample modules or bags). This 
allows the massive data sets to be stored in distributed fashion. 
Moreover subsample modules can be processed in parallel 
using distributed computing architectures. The BLB samples 
are constructed by assigning random weights from multino¬ 
mial distribution to the data points of a disjoint subsample. 
Although in BLB the problem of handling and processing 
massive bootstrap samples is alleviated, yet (re)computation 
of the estimates for a large number of bootstrap samples 
is prohibitively expensive. Thus, on the one hand BLB is 
impractical for many commonly used modern estimators that 
typically have a high complexity. Such estimators often require 
solving demanding optimization problems numerically. On the 
other hand, using the primitive LS estimator in the original 
BLB scheme does not provide a statistically robust bootstrap 
procedure as the LS estimator is known to be very sensitive 
in the face of outliers. 

In this paper we address the problem of bootstrapping 
massive data sets by introducing a low complexity and robust 
bootstrap method. The new method possesses similar scala¬ 
bility property as the BLB scheme with significantly lower 
computational complexity. Low complexity is achieved by 
utilizing for each subset a fast fixed-point estimation technique 
stemming from Last and Robust Bootstrap (LRB) method 0, 
0.GO)- It avoids (re)computation of fixed-point equations for 
each bootstrap sample via a smart approximation. Although 
the LRB method possesses a lower complexity in comparison 
with the conventional bootstrap, the original LRB is incom¬ 
patible with distributed processing and storage platforms and 
it is not suitable for bootstrap analysis of massive data sets. 
Our proposed bootstrap method is scalable and compatible 
with distributed computing architectures and storage systems, 
robust to outliers and consistently provides accurate results in 
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a much faster rate than the original BLB method. We note 
that some preliminary results of the proposed approach were 
presented in the conference paper mi- 

The paper is organized as follows. In Section [I^ the BLB 
and FRB methods are reviewed. The new bootstrap scheme 
(BLFRB) is proposed in Section III followed by implemen¬ 
tation of the method for MM-estimator of regression Cl- In 
Section IV Consistency and statistical robustness of the new 
method are discussed. Section |V] provides simulation studies 
and an example of using the new method for analysis of a real 
world big data set. Section VI concludes. 


II. Related bootstrap methods 

In this section, we briefly describe the ideas of the BLB |TJ 
and FRB Q methods. The pros and cons of both methods are 
discussed as well. 


A. Bag of Little Bootstraps 


Let X = (xi • • • x„) G be a d dimensional observed 

data set of size n. The volume and dimensionality of the data 
may be so high that it cannot be processed or stored in a 
single node. Consider G as an estimator of a parameter 
of interest 0 G R‘^ based on X. Computation of estimate of 
uncertainty ^ (e.g., confidence intervals, standard deviation, 
etc,) for 0n is of great interest as for large data sets confidence 
intervals are often more informative than plain point estimates. 

The bag of little bootstraps (BLB) ||T) is a scalable bootstrap 
scheme that draws disjoint subsamples X = (xi • • • Xf,) G 
Rdxb (yvJjiqJj form ’’bags” or ’’modules”) of smaller size 
b = I 7 G [0.6,0.9]} by randomly resampling without 

replacement from columns of X. For example if n = 10^ and 
7 = 0.6, then b = 15849. For each subsample module, boot¬ 
strap samples, X*, are generated by assigning a random weight 
vector n* = {m,... ,nl) from Multinomial{n,{l/b)\f) to 

data points of the subsample, where the weights sum to n. 

'' * 

The desired estimate of uncertainty ^ is computed based on 

^ * 

the population within each subsample module and the final 


estimate is obtained by averaging 4 ’s over the modules. 

In the BLB scheme each bootstrap sample contains at 
most b distinct data points. Thus the BLB approach produces 
the bootstrap replicas with reduced effort in comparison to 
conventional bootstrap 0 . Furthermore, the computation for 
each subsample can be done in parallel by different computing 
nodes. Nevertheless, (re)computing the value of estimator for 
each bootstrap sample for example thousands of times is still 
computationally impractical even for estimators of moderate 
level of complexity. This includes a wide range of modern 
estimators that are solutions to optimization problems such as 
maximum likelihood methods or highly robust estimators of 
linear regression. The BLB method was originally introduced 
with the primitive LS estimator. Such combination does not 
provide a statistically robust bootstrap procedure as the LS 
estimator is known to be very sensitive in the face of outliers. 
Later in section IV of this paper we show that even one 
outlying data point is sufficient to break down the BLB results. 


B. Fast and Robust Bootstrap 

The fast and robust bootstrap method g, g, fig is 
computationally efficient and robust to outliers in comparison 
with conventional bootstrap. It is applicable for estimators 
On G R‘^ that can be expressed as a solution to a system 
of smooth fixed-point (FP) equations: 

0n = Q(On;X), (1) 

where Q : The bootstrap replicated estimator 0^ 

then solves 

0n = Q(K;X*), (2) 


where the function Q is same as in Q but now dependent 
on the bootstrap sample X*. Then, instead of computing 
from g, we compute: 

0n=Q(0n;X*), (3) 

where the notation denotes an approximation of in 
g with initial value based on bootstrap sample X*. In 
fact is a one-step improvement of the initial estimate. 
In conventional bootstrap, one uses the distribution of to 
estimate the sampling distribution of On- Since the distribution 
of the one-step estimator 9^ does not accurately reflect the 
sampling variability of 0, but typically underestimates it, a 
linear correction needs to be applied as follows: 

oT = 0n+[l- VQ(0„; X)] ~\0l* - On) , (4) 


where VQ (•) G is the matrix of partial derivatives w.r.t. 
On- Then under sufficient regularity conditions, will be 

estimating the limiting distribution of On- In most applications, 

^ * 

On is not only significantly faster to compute than On, 
but numerically more stable and statistically robust as well. 
However, the original FRB is not scalable or compatible with 
distributed storage and processing systems. Hence, it is not 
suited for bootstrap analysis of massive data sets. The method 
has been applied to many complex fixed-point estimators such 
as FastICA estimator |13|, PCA and highly robust estimators 
of linear regression gl 


III. Fast and Robust Bootstrap for Big Data 

In this section we propose a new bootstrap method that 
combines the desirable properties of the BLB and FRB meth¬ 
ods. The method can be applied to any estimator representable 
as smooth FP equations. The developed Bag of Little Fast 
and Robust Bootstraps (BLFRB) method is suitable for big 
data analysis because of its scalability and low computational 
complexity. Recall that the main computational burden of the 
BLB scheme is in recomputation of estimating equation g for 
each bootstrap sample X*. Such computational complexity can 
be drastically reduced by computing the FRB replications as 
in g instead. This can to be done locally within each bag. Let 
0n,b be a solution to equation g for subsample X G 

Kb = Q{0u,b\X)- (5) 

Let X* G be a bootstrap sample of size n randomly 

resampled with replacement from disjoint subset X of size 
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Algorithm 1: The BLFRB procedure 

1 Draw s subsamples (which form ’’bags” or ’’modules”) 

X = (xi • • • Xf,) of smaller size 

b = I 7 € [0.6,0.9]} by randomly sampling without 

replacement from columns of X; 
for each subsample X do 

2 Generate r bootstrap samples by resampling as 
follows: Bootstrap sample X* = (X;n*) is formed 
by assigning a random weight vector 

n* = (ni,... ,nl) from Multinomial{n, (1/6)1{,) to 
columns of X; 

3 Find the initial estimate dn,b that solves (|^ and for 

each bootstrap sample X* compute 9^ ^ from 
equation 

4 Compute the desired estimate of uncertainty ^ based 

~R* 

on the population of r FRB replicated values 5 ; 

5 Average the computed values of the estimate of 

uncertainty over the subsamples, i.e., ^ } z2k=i ^ 


6 ; or equivalently generated by assigning a random weight 
vector n* = (ni,...,n^) from Multinomial{n,{l/b)\b) to 
data points of X. The FRB replication of On b can be obtained 
by 


G 'T, h - ^71 


[I-VQ(0„,b;X)] 


-1 


^n,b ^r. 


( 6 ) 


. 1 * 


where 9nb = Q(^n,&;X*) is the one-step estimator and 
VQ(-) S is the matrix of partial derivatives w.r.t. 

9n,b- The proposed BLFRB procedure is given in detail in 
Algorithm The steps of the algorithm are illustrated in Fig. 

where k = I,..., s denotes the disjoint subsamples 

and corresponds to the }th bootstrap sample generated 

from the distinct subsample k. Note that the terms 6n,b and 
[I-VQ(e„,f,;X)]-^ are computed only once for each bag. 

While the BLFRB procedure inherits the scalability of BLB, 
it is radically faster to compute, since the replication 6^ ^ can 
be computed in closed-form with small number of distinct 
data points. Low complexity of the BLFRB scheme allows 
for fast and scalable computation of confidence intervals 
for commonly used modern fixed-point estimators such as 
FastICA estimator 1 1^, PCA and highly robust estimators of 
linear regression ||2 . 


A. BLFRB for MM-estimator of linear regression 

Here we present a practical example formulation of the 
method, where the proposed BLBFR method is used for 
linear regression. In order to construct a statistically robust 
bootstrap method, MM-estimator that lends itself to fixed point 
estimation equations is employed for bootstrap replicas. Let 
X = {[yi.'zjy,... ,{yn,'^ny}, Zj e RP, be a sample of 
independent random vectors that follow the linear model: 

yi=zj9 + aQei for i = l,...,n, (7) 


where 0 S is the unknown parameter vector. Noise terms 
efs are i.i.d. random variables from a symmetric distribution 
with unit scale. 

Highly robust MM-estimators are based on two loss 
functions po : K and pi : K —)■ K+ which determine the 

breakdown point and efficiency of the estimator, respectively. 
The po(’) and pi(-) functions are symmetric, twice continu¬ 
ously differentiable with p(0) = 0, strictly increasing on [0, c] 
and constant on [c, oo) for some constant c. The MM-estimate 
of 6n satisfies 


1 

n 




Vi - 2,J Or. 



( 8 ) 


where (T„ is a S-estimate m of scale. Consider M-estimate 
of scale Sn{d) defined as a solution to 



i=l 


( 0 ) ) 


(9) 


where m = po(oo)/2 is a constant. Let be the argument 
that minimizes Sn{9), 

On = arg min s„(0), 

6»GRp 


then &n = Sn{dn). 

We employ the Tukey’s loss function: 


Pe{u) 



2c? 


jui 

6cf 


for |m| < Ce 
for |m| > Ce 


which is widely used as the p functions of the MM-estimator, 
where subscript e represents different tunings of the function. 
For instance an MM-estimator with efficiency O = 95% and 
breakdown point BP = 50% (i.e. for Gaussian errors) is 
achievable by tuning Pe{u) into cq = 1,547 and ci = 4,685 
for po and pi respectively (see GH p.l42, tab. 19]). In this 
paper ([^ is computed using an iterative algorithm proposed 
in m- The initial values of iteration are obtained from 
which in turns are computed using the FastS algorithm GZl- 
In order to apply the BLFRB method to MM-estimator, 
and (|^ need to be presented in form of FP equations 
scalable to number of distinct data points in the data. The 
corresponding scalable one-step MM-estimates and ay 
are obtained by modifying ||^ eq. 17 and 18] as follows. 
Let X* = (X;n*) denote a BLB bootstrap sample based on 
subsample X = {{yi,zjy,... ,{yb,zjy}, Zi G Rp and a 
weight vector n* = (n} ■ ■ ■ nl) G 




n, cj,ZoZ, 


-1 b 


E 


n, uj^z^y^, 




where 


fi = yi- zl On, ri = yz- Z.J On, 


libi = and Fj = —poifilayifx- 

nm 


( 10 ) 

( 11 ) 
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Fig. 1. The steps of the BLFRB procedure (Algorithmic are depicted. Disjoint subsamples of significantly smaller size b are drawn from the original Big 
Data set X. The initial estimate 9^ b is obtained by solving fixed-point estimating equation only once for each subsample X. Within each module, the FRB 
replicas 0^ ^ are computed for each bootstrap sample X* using the initial estimate 9n,h- The final estimate of uncertainty ^ is obtained by averaging the 
results of distinct subsample modules. 


The BLFRB replications of On, are obtained by applying 
the FRB linear correction as in |j^ eq. 20], to the one-step 
estimators of < [T^ and ([n). 

IV. Statistical properties 
Next we establish the asymptotic convergence and statistical 
robustness of the proposed BLFRB method. 

A. Statistical Convergence 

We show that the asymptotic distribution of BLFRB replicas 
in each bag is the same as the conventional bootstrap. Let 
X = {xi,..., x„} be a set of observed data as the outcome 
of i.i.d. random variables X = {Xi,...,Xn} from an 
unknown distribution P. The empirical distribution (measure) 
formed by X is denoted by linear combination of the Dirac 
measures at the observations P„ = n~^ Sti ~ 

n~^J2'i=i and denote the em¬ 

pirical distributions formed by subsample X^^^ and bootstrap 
sample X* respectively. We also use (p{-) for functional rep- 

resentations of the estimator e.g., 6 = (f>{P), ^nb — 4‘i^n 1) 

and 6nb — 4‘i^n b)- notation = denotes that both sides 
have the same limiting distribution. 

(k) 

Theorem 4.1: Consider P, P„ and P^ ^ as maps from a 
Donsker class to K such that Ps = {f ~ 9 ■ ft9 S 
P,{P{f — is measurable for every ^ > 0. 

Let (j) to be Hadamard differentiable at P tangentially to some 
subspace and On be a solution to a system of smooth FP 
equations. Then as n,b —>■ oo 

V^{eZ-sZ) = y^i^n-0)- ( 12 ) 

See the proof in the Appendix. 


B. Statistical robustness 

Consider the linear model 0 and let be an estimator of 
the parameter vector 6 based on X. Let qt,t G (0,1), denote 
the fth upper quantile of \6n\i, where [9n]i is the Zth element 
of On, I — l,...,p. In other words Pr{^0n]i > qt) = t. 
Here we study the robustness properties of BLB and BLFRB 
estimates of qt- We only focus on the robustness properties of 
one bag as it is easy to see that the end results of both methods 
break down, if only one bag produces a corrupted estimate. 

Let q^ denote the BLB or BLFRB estimate of the qt based 
on a random subsample X of size b = {[rr'’'J I 7 S [0.6,0.9]} 
drawn from a big data set X. Following | |T8| , we define the 
upper breakdown point of 5 } as the minimum proportion of 
asymmetric outlier contamination in subsample X that can 
drive g} over any finite bound. 

Theorem 4.2: In the original BLB setting with LS estimator, 
only one outlying data point in a subsample X is sufficient to 
drive q^, t G ( 0 , 1 ) over any finite bound and hence, ruining the 
end result of the whole scheme. See the proof in the Appendix. 


Let X = {( 2 / 1 , ) ,..., (yn, ) }, be an observed data 

set following the linear model Q. Assume that the explanatory 
variables z^ G are in general position [15 p. 117]. Let 
On be an MM-estimate of 0 based on X. According to 110 


Theorem 2], the FRB estimate of the fth quantile of [0n\i 
remains bounded as far as On in equation ([T]i is a reliable 
estimate of 0 and more than (1 — f)% of the bootstrap samples 
contain at least p good (i.e., non-outlying) data points. This 
means that in FRB, higher quantiles are more robust than the 
lower ones. Here we show that in a BLFRB bag the former 
condition guarantees the latter. 

Theorem 4.3: Let X = {(yi, z]})^,..., Zj})^}, be a 
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TABLE I 

LFpper breakdown point op the BLFRB estimates op quantiles 

FOR MM-REGRESSION ESTIMATOR WITH 50% BREAKDOWN POINT AND 
95% EPPICIENCY AT THE GAUSSIAN MODEL. 


p 

n 

7 = 0.6 

II 

p 

7 = 0.8 


50000 

0.425 

0.475 

0.491 

50 

200000 

0.467 

0.490 

0.497 


1000000 

0.488 

0.497 

0.499 


50000 

0.349 

0.449 

0.483 

100 

200000 

0.434 

0.481 

0.494 


1000000 

0.475 

0.494 

0.498 

200 

50000 

0.197 

0.398 

0.465 

200000 

0.368 

0.461 

0.488 


1000000 

0.450 

0.487 

0.497 
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Fig. 2. The true distribution of the right hand side of 0 along with the 
obtained empirical distributions of the left hand side for two elements of 6^ ^ 
with the best and the worst estimates. 


subsample of size b = I 7 S [0.6, 0.9]} randomly drawn 

from X following the linear model 0 . Assume that the 
explanatory variables G are in general position. 

Let 9n,b be an MM-estimator of 6 based on X and let 5b be 
the finite sample breakdown point of Then in the BLFRB 
bag formed by X, all the estimated quantiles q^, t G (0,1) 
have the same breakdown point equal to 5b. See the proof in 
the Appendix. 

Theorem 4.3 implies that in the BLFRB setting, lower quan¬ 
tiles are as robust as higher ones with breakdown point equal 
to 5b which can be set close to 0.5. This provides the maximum 
possible statistical robustness for the quantile estimates. In the 
proof we show that if dn,b is a reliable MM-estimate of 9, 
then all the bootstrap samples of size n drawn from X are 
constrained to have at least p good data points. 

Table 1 illustrates the upper breakdown points of the 
BLFRB estimates of quantiles for various dimensions of data 
and different subsample sizes. The MM-regression estimator 
is tuned into 50% breakdown point and 95% efficiency at the 
central model. The results reveal that BLFRB is signihcantly 
more robust than the original BLB with LS estimator. Another 
important outcome of the table is that, when choosing the size 
of subsamples b = , the dimension p of the data should 

be taken into account; For example for a data set of size 
n = 50000 and p = 200, setting 7 = 0.6 or 0.7 are not 
the right choices. 


V. Numerical Examples 

In this section the performance of the BLFRB method is as¬ 
sessed by simulation studies. We also perform the simulations 
with the original BLB method for comparison purposes. 

A. Simulation studies 

We generate a simulated data set X = 
{{yi.'iJY,..., {un, z^)^} of size n = 50000 following the 
linear model yi = zj9 + aoCi, {i = l,...,n), where the 
explaining variables are generated from p-variate normal 
distribution A/’p(0,Ip) with p = 50, p-dimensional parameter 
vector 9 — Ip, noise terms are i.i.d. from the standard normal 
distribution and noise variance is ctq = 0.1. 

The MM-estimator in the BLFRB scheme is tuned to have 
efficiency O = 95% and breakdown point BP = 50%. The 
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Fig. 3. The average of all p BLFRB estimated distributions, along with the 
true distribution. Note that the averaged empirical distribution converge to the 
true cdf. This confirms the results of theorem ITT] 


original BLB scheme in Q uses LS-estimator for computation 
of the bootstrap estimates of 9. 


Here, we first verify the result of theorem 4.1 in simulation 


by comparing the distribution of the left hand side of ( [T^ 
with the right hand side. Given the above settings, the right 
hand side of ( [T^ follows A/}(0, tTo/dp) in distribution 112 
theorem 4.1]. We form the distribution of the left hand side, 
by drawing a random subsample X of size b = [50000°'^] = 
1946 and performing steps 2 and 3 of the BLFRB procedure 
(i.e.. Algorithm for X using r = 1000 bootstrap samples. 
FigJ^shows the true distribution of — 0 ) along with the 

obtained empirical distributions of ^/n(9^ ^ — 9n,b) for two 

elements of i, with the best and the worst outcomes. The 
result of averaging all the p empirical distributions is illustrated 
in FigJ^ along with the true distribution. Note that the results 


are in conformity with theorem 4.1 


Next, we compare the performance of the BLB and BLFRB 
methods. We compute bootstrap estimate of standard deviation 
(SD) of 9n by the two methods. In other words, the estimate 
of uncertainty in step 4 of the procedure (i.e., see Figj^ for 
bag k is as follows: 


(p> =sD(i«i‘;],)=I y: 


*{kj) 

n,b 


(fe-)i 


]/ - Wn,b 


1/2 


. 1=1 


r — 1 
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Fig. 4. Relative errors of the BLB (dashed line) and BLFRB (solid line) 
methods w.r.t. the number of bootstrap samples r are illustrated. Both methods 
perform equally well when there are no outliers in the data. 




where [0n,b]i denotes the ^th element of 6n,b and j, ]; = 
^Yl'^j=i{^nb\i- The step 5 of the procedure for the kh 
element of On^b is obtained by: 

Cl = = - ^ ^{[Onlh), Z = 1, . . . ,p. 

^ k^l 

The performance of the BLB and BLFRB are assessed by 
computing a relative error defined as: 


£ = 


SD(0„) - SD„(0„) 


SD„(0„) 


where SD(0„) = iELiSD([0„]O and SDo(0„) = 

(To /\/nO is (approximation of) the average standard deviation 
of On based on the asymptotic covariance matrix 1121 (i.e., O 
is 0.95 for the MM-estimator and 1 for the LS-estimator). The 
bootstrap setup is as follows; Number of disjoint subsamples 
is s = 25, size of each subsample is 6 = = 1946 with 

7 = 0.7, maximum number of bootstrap samples in each 
subsample module is Tmax = 300. We start from r = 2 
and continually add a new set of bootstrap samples (while 
r < Tmax) to subsample modules. The convergence of relative 
errors w.r.t. the number of bootstrap samples r are illustrated in 
Figg Note that when the data is not contaminated by outliers, 
both methods perform similarly in terms of achieving lower 
level of relative errors for higher number of bootstrap samples. 


We study the robustness properties of the methods using the 


above settings. According to theorem 4.2 only one outlying 


data point is sufficient to drive the BLB estimates of SD(0„) 
over any finite bound. To introduce such outlier, we randomly 
choose one of the original data points and multiply it by 
a large number a . Such contamination scenario resembles 
misplacement of the decimal point in real world data sets. 
Lack of robustness of the BLB method is illustrated in FigJSa] 
for a = 500 and a = 1000. 

According to Table 1, for the settings of our example 
the upper breakdown point of BLFRB quantile estimates is 
Sb = 0.475. Let us asses the statistical robustness of the 
BLFRB scheme by severely contaminating the original data 



(a) 


O) 

> 

"oi 



50 100 150 200 250 

No. bootstrap samples (r) 

(b) 


300 


Fig. 5. (a) Relative errors of the BLB method illustrating severe lack of 
robustness in face of only one outlying data point, (b) Relative errors of the 
BLFRB method illustrating the reliable performance of the method in the face 
of severely contaminated data. 


points of the first bag. We multiply 40% ([0.4 x 6J = 778) 
of the data points by a = 1000. As shown in Figj^ BLFRB 
still performs highly robust despite such proportion of outlying 
data points. 

Now, let us make an intuitive comparison between com¬ 
putational complexity of the BLB and BLFRB methods by 
using the MM-estimator in both methods. We use an identical 
computing system to compute bootstrap standard deviation 
(SD) of 6n by the two methods. The computed e and the 
cumulative processing time are stored after each iteration 
(i.e.,adding new set of bootstrap samples to the bags). Fig§ 
reports relative errors w.r.t. the required cumulative processing 
time after each iteration of the algorithms. The BLFRB is 
remarkably faster since the method avoids solving estimating 
equations for each bootstrap sample. 

B. Real world data 

Finally, we use the BLFRB method for bootstrap analysis 
of a real large data set. We consider the simplified version 
of the the Million Song Dataset (MSD) p^ , available on 
the UCI Machine Learning Repository |20) . The data set 
X = z7, (j/n, ^n)^} contains n = 515345 music 
tracks, where (i.e., i = represents the released 

year of the ith song (i.e., ranging from 1922 to 2011) and 
Zi e RP is a vector of p = 90 different audio features of each 
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Fig. 6. Relative errors e w.r.t. the required processing time of each BLB 
and BLFRB iteration. The BLFRB is significantly faster to compute as the 
(re)computation of the estimating equations is not needed in this method. 


Fig. 7. The 95% confidence intervals computed by BLFRB method is shown 
for some of the audio features of the MSD data set. The null hypothesis in 
accepted for those features having 0 inside the interval. 


song. The used features are the average and non-redundant 
covariance values of the timbre vectors of the song. 

The linear regression can be used to predict the released 
year of a song based on its audio features. We use the BLFRB 
method to conduct a fast, robust and scalable bootstrap test 
on the regression coefficients. In other words, considering the 
linear model yi = 2 ,J6 + a^Ci, we use BLFRB for testing 
hypothesis Lfo : \6\i =0 vs. "Hi : 7 ^ 0 , where 

(i.e., I = I,... ,p) denotes the fth element of 6. The BLFRB 
test of level a rejects the null hypothesis if the computed 
100(1 — a)% confidence interval does not contain 0. Here we 
run the BLFRB hypothesis test of level a = 0.05 with the 
following bootstrap setup; Number of disjoint subsamples is 
s = 51, size of each subsample is 6 = = 9964 with 7 = 

0.7, number of bootstrap samples in each subsample module 
is r = 500. Among all the 90 features, the null hypothesis is 
accepted only for 6 features numbered: 32,40,44,47, 54, 75. 
Figj^ shows the computed 95% CIs of the features. In order 
to provide a closer view, we have only shown the results for 
feature numbers 30 to 80. These results can be exploited to 
reduce the dimension of the data by excluding the ineffective 
variables from the regression analysis. 

Using the BLFRB method with the highly robust MM- 
estimator, we ensure that the computational process is done 
in a reasonable time frame and the results are not affected by 
possible outliers in the data. Such desirable properties are not 
offered by the other methods considered in our comparisons. 

VI. Conclusion 

In this paper, a new robust, scalable and low complexity 
bootstrap method is introduced with the aim of finding param¬ 
eter estimates and confidence measures for very large scale 
data sets. The statistical properties of the method including 
convergence and robustness are established using analytical 
methods. While the proposed BLFRB method is fully scalable 
and compatible with distributed computing systems, it is re¬ 
markably faster and significantly more robust than the original 
BLB method |[T). 


Appendix A 

Here we provide the proofs of the theoretical results of 
sections IIV-AI and IIV-BI 


Proof of theorem |4.1| Given that is Donsker class, as 

n —>■ 00: 

(13) 


= v^(P„ - P) 4 




where Gp is the P-Brownian bridge process and the notation 


—7 denotes convergence in distribution. According to |21 
theorem 3.6.3] as b,n 


00: 


(fc)x d 


(G4 = v4(p4-p1';^) 


JTp. 


(14) 


Thus, and 


^n,b 


converge in distribution to the same 


limit. The functional delta method for bootstrap |22 theorem 
23.9] in conjunction with |23 lemma 1] imply that, for every 
Hadamard-differentiable function (h: 


4^(</)(p„) - </.(p)) 

and conditionally on 


(j)'p{Gp), 


• n,6’ 


v4{</)(P4) - </)(pS)) 4 4p(Gp), 


(15) 


(16) 


where (j)p{Gp) is the derivative of (p w.r.t. P at Gp. Thus, 

(k) 

conditionally on P]^ 

Vn(K,b-dnl,) = Vn(dn-0). (17) 

According to ^ equation 5 and 6 ] and given that can be 
expressed as a solution to a system of smooth FP equations: 

.— i?* (k) fi .— ^ (k), 

(18) 

Form and ( fTS] ): 


MoZ - 0Z) = - e) 


(19) 


which concludes the proof. □ 

The following lemma is needed to prove theorems 14.21 and 
1431 

Lemma 1: Let X = (xi • • • Xb) be a subset of size b = 
{[rP\ I 7 € [0.6, 0.9]} randomly resampled without replace¬ 
ment from a big data set X of size n. Let X* be a bootstrap 
sample of size n randomly resampled with replacement from 
X (i.e., or equivalently formed by assigning a random weight 
vector n* = (ni,... ,n^) from Multinomial{n, (l/ 6 )lt,) to 
columns of X ). Then: 


lim Prjxfj) ^ X*| for any X(j\ G X} -> 0. 





























Proof of lemma Consider an arbitrary data point G 
X. The probability that X(j) does not occur in a bootstrap 
sample of size n —oo is: 


lim Pr(Binomial{n, —) < 1) = 

n—>-oo Tl^ 


lim (1 

n—>-oo 


1 

rP 


r 


lim exp 

n—>-oo 


\ l/n J 


lim exp 

n—>-oo 


\ 

n'T'+i — n) 


= 0 . 


□ 


Such probability for n = 20000 and 7 = 0.7 is 3.3 x 10 ® 

Proof of theorem 


4.2 


Let X(j) S X be an outlying data 
point in X. According to lemma all bootstrap samples 
drawn from that subsample will be contaminated by X(i). This 
is sufficient to break all the LS replicas of the estimator in 
that bag and consequently ruining the end result of the whole 
scheme. □ 

Proof of theorem 14.31 

According to Theorem 2], The FRB estimate of qt 
remains bounded as far as: 


1. in equation ([T]) is a reliable estimate of 6, and 

2. More than {1 — t)% of the bootstrap samples contain at 
least p ’’good” (i.e., non-outlying) data points. 

The first condition implies that, in a BLFRB bag if 9n,b is a 
corrupted estimate then all bootstrap estimates q^, t G ( 0 , 1 ) 
will break as well. In the rest of the proof we show that if the 
percentage of outliers in X is such that dn,b is still a reliable 
estimate of 6, then all the bootstrap samples drawn from X 
contain at least p good (non-outlying) data points. This suffices 
for all ql, t G (0,1) to remain bounded. 

Let t, be an MM-estimate of 0. Let the initial scale of the 
MM-estimator obtain by a high breakdown point S-estimator. 
The finite sample breakdown point of the S-estimator for a 
subsample of size b is as follows: 


= 


L&/2J -P + 2 


115 Theorem 8 ]. Given that 9n,b is a reliable estimate, the 


initial S-estimate of the scale parameter is not broken. This 
implies that there exist at least h = b— \ b/2\ +p— 1 good data 
points in general position in X. It is easy to see that h > p. 
Applying Lemma 1 for each of the good points concludes that 
the probability of drawing a bootstrap sample of size n with 
less than p good data points goes to zero for large n, which 
is the case of big data sets. □ 
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